YouTube maintains a list of the top trending videos. To determine the year’s top-trending videos, YouTube uses a combination of factors including measuring users interactions (number of views, shares, comments and likes). Our dataset for this project is derived from [1].
Our dataset is a daily record of the top trending YouTube videos and includes several months (and counting) of data on daily trending YouTube videos. The regions that this dataset covers are the US, GB, DE, CA, and FR regions (USA, Great Britain, Germany, Canada, and France, respectively), with up to 200 listed trending videos per day. Each region’s data is in a separate file. We focus on the file for the United States. Kaggle collected this data through YouTube API.
The data includes several features such as the video title, trending_date, channel title, publish time, tags, views, likes and dislikes, description, and comment count with 23363 observations. The variables likes, dislikes, comment counts are the only continuous variables. There exist text data and time-series data as well.
In this project we aim to relate the number of views to other numeric variables through a linear regression model and perform statistically analysis techniques that we learned in the course. This includes identifying the factors that have most influence on the model. We start with loading the data and preprocess it to handle missing values etc. First, we include all the packages the we need for this project.
library(ggplot2)
library(dplyr)
library(corrplot)
library(lubridate)
As we pointed out, the data is given in [1]. But we download it and save it to the disk.
us_videos <- read.csv("USvideos.csv")
us_videos
We start with some initial data processing. We do not need all features of the data. To occupy less memory, we drop unnecessary columns.
We drop the following columns: “video_id”, “trending_date”, “views”, “likes”, “dislikes”, “comment_count”, “category_id”, “publish_time”,“channel_title”, and “title”.
favor_cols <- c("video_id", "trending_date", "views", "likes", "dislikes", "comment_count",
"category_id", "publish_time","channel_title", "title")
us_videos <- us_videos[ , favor_cols]
us_videos$category_id <- as.factor(us_videos$category_id)
us_videos
We turning the data type for trending_date and publish_time from factor into a standard date format.
us_videos$trending_date <- ydm(us_videos$trending_date)
us_videos$publish_time <- ymd(substr(us_videos$publish_time,start = 1,stop = 10))
us_videos
As the code block below shows, there is no missing value in our data.
sum(is.na(us_videos))
[1] 0
Question: Do we have exactly one observation per video?
The videos are determined by their ID’s. So, we count the number of different ID’s as a factor variable. If this number matchs with the total number of observations, then we have exactly one instance per video.
length(levels(as.factor(us_videos$video_id)))
[1] 4712
Hence, we only have \(4,712\) videos but the number of observations is \(23,362\). So we have redundant information for a single video. For instance, for a video whose ID is “2kyS6SvSYSE”, we take a look to see what information can be captured form this video.
us_videos[us_videos$video_id == "2kyS6SvSYSE",]
As it can be seen from this query, for this video id we have \(7\) observations. After a careful look at the data, we realize that each is per trending day. This video has been in trending for seven days from November 14, 2011 to November 20, 2011. It was published in November 13. Thus, each video has appeared in the observations as many days as it has been trending.
We can use the group_by method of “dplyer” library of R to group the data observations by their video_id.
arrange(us_videos, video_id, desc(trending_date))
Are the counts for views, likes, dislikes and comment_count computed cumulatively or are they resenting the values for each day. Unfortunately, I did not find any information in the data source regarding this issue. I assume that they are calculated cumulatively as per watch a video in the YouTube, the number of views increases by one.
We extract actual numeric values by aggregating over the history for each video so that we have exactly one data instance for each video. We call the new dataset by “num_feature”.
num_feature<- us_videos %>% group_by(video_id) %>%
summarize(
views = sum(views), likes = max(likes),
dislikes = max(dislikes), comment_count = max(comment_count)
)
num_feature
We plot the mean, median and variance of views per category_id. Unfortunately, there are extreme outliers and they do not allow see the details well in the plot.
us_videos %>% group_by(category_id) %>%
summarize(mean_views = mean(views), sd_views = sd(views))
categ_data <- us_videos %>% group_by(video_id, category_id) %>%
summarize(
views = sum(views), likes = max(likes),
dislikes = max(dislikes), comment_count = max(comment_count)
)
p <- ggplot(categ_data, aes(x=category_id, y=views, color = category_id) )+ geom_boxplot(outlier.colour="black", outlier.shape=16,
outlier.size=2, notch=FALSE) + coord_flip()
p
We start with the full model, which include all variables.
view_full <- lm(views ~ likes + dislikes + comment_count, data = num_feature)
summary(view_full)
Call:
lm(formula = views ~ likes + dislikes + comment_count, data = num_feature)
Residuals:
Min 1Q Median 3Q Max
-1.56e+08 -1.60e+06 -1.19e+06 -2.20e+05 4.60e+08
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.25e+06 2.28e+05 5.5 4.1e-08 ***
likes 1.61e+02 2.37e+00 68.2 < 2e-16 ***
dislikes 3.79e+02 1.18e+01 32.1 < 2e-16 ***
comment_count -4.50e+02 1.75e+01 -25.7 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 1.5e+07 on 4708 degrees of freedom
Multiple R-squared: 0.649, Adjusted R-squared: 0.649
F-statistic: 2.9e+03 on 3 and 4708 DF, p-value: <2e-16
The p-value of the F-statistic is less than \(2.2*10^{-16}\) which means that the regression is significant, i.e. the coefficients of at least one the predictors “likes”, “dislikes”, “comment_count” is non-zero.
The regression model determined by the least square method on this dataset is
\[{\bf views} = 1255000 + 161.5 * ({\bf likes}) + 378.9 * ({\bf dislikes}) - 450.4 * ({\bf comment\_count})\] The adj-\(R^2\) is \(0.64\) which is not that high. This value could be due to multicollinearity among the predictors. We check for collinearity among the predictors.
cor_matrix <- cor(num_feature[, c("views", "likes", "dislikes", "comment_count")])
corrplot(cor_matrix, method = "number") #order = "hclust"
As the correlation matrix shows, there is high collinearity between some variables. For instance, dislikes and comment count have high correlations. This makes sense, because usually when people are angry about a product, they would write a review. Similarly, views and likes have high correlation as well. So likes seems to be a good predictor for the number of views. Moreover, likes and the comment counts have remarkable correlation. Thus, if one of the predictors likes or comment count is available in the model, the other one might be considered for exclusion from the model. We will examine these dependencies later when we are modelling.
We calculate the Variance Inflation Factor (VIF) among the predictor variables.
car::vif(view_full)
likes dislikes comment_count
2.4 3.7 6.1
The VIF for “comment_count” is begiier that \(5\). Since \(5 \leq VIF \leq 10\) are considered significant, the predictor “comment_count” would poorly estimate the regression coefficient.
kappa(num_feature[,c("likes", "dislikes", "comment_count")])
[1] 15
We start reducing our model by dropping the predictor whose VIF is high. Thus, we drop the “comment_count” from our full model. Then, we evaluate the reduced model for its VIF.
view_like_dislike <- lm(views ~ likes + dislikes, data = num_feature)
car::vif(view_like_dislike)
likes dislikes
1.3 1.3
As it is clear from the new VIF’s for the new model, each of new “likes” and “dislikes” have a good VIF. So, we prefer the reduced model over the full model.
Though, we do not see any significant VIF, we should compare the performance of the reduced model with the full model in terms of the adj-\(R^2\).
summary(view_like_dislike)
Call:
lm(formula = views ~ likes + dislikes, data = num_feature)
Residuals:
Min 1Q Median 3Q Max
-2.35e+08 -1.59e+06 -1.23e+06 -3.28e+05 4.60e+08
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.27e+06 2.44e+05 5.22 1.9e-07 ***
likes 1.20e+02 1.84e+00 64.98 < 2e-16 ***
dislikes 1.32e+02 7.31e+00 18.08 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 1.6e+07 on 4709 degrees of freedom
Multiple R-squared: 0.6, Adjusted R-squared: 0.6
F-statistic: 3.53e+03 on 2 and 4709 DF, p-value: <2e-16
The regression model induced by these predictors is
\[views = 1272000 + 119.9 * (likes) + 132.1* (dislikes)\] The adj-\(R^2\) for the new model is \(0.59\) but for the full model is \(0.64\). So, we are missing some amount of accuracy. If there was no loss on the accuracy, then the reduced model is preferred over the full model. But with this loss, we may need to check other model candidates by reducing the model even further.
view_like <- lm(views ~ likes, data = num_feature)
summary(view_like)
Call:
lm(formula = views ~ likes, data = num_feature)
Residuals:
Min 1Q Median 3Q Max
-2.72e+08 -1.44e+06 -9.91e+05 -1.88e+05 4.60e+08
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.02e+06 2.52e+05 4.04 5.5e-05 ***
likes 1.35e+02 1.70e+00 79.35 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 1.7e+07 on 4710 degrees of freedom
Multiple R-squared: 0.572, Adjusted R-squared: 0.572
F-statistic: 6.3e+03 on 1 and 4710 DF, p-value: <2e-16
Thus, the linear regression model between “views” and “likes” is determined by \(views = 1016000 + 135*(likes)\). As the p-value for the F-statistic show, the regression is significant. The adj-\(R^2\) for this third model is \(0.57\) which is lower than the adj-\(R^2\) for the second model and the first model.
We take a look at the confidence intervals and prediction intervals.
temp_predic <- predict(view_like, interval="prediction")
predictions on current data refer to _future_ responses
new_df <- cbind(num_feature, temp_predic)
ggplot(new_df, aes(likes, views))+
geom_point() +
geom_line(aes(y=lwr), color = "red", linetype = "dashed")+
geom_line(aes(y=upr), color = "red", linetype = "dashed")+
geom_smooth(method=lm, se=TRUE)
We summarize our models in the table below:
Three Linear Regression Models
As it is clear from the table, by reducing the models we miss some amount of accuracy. From the first model to the second model, we lose about \(4\%\) of accuracy. But the tradeoff is that we won’t have the collinearity in this model. The last model has only one predictor and has less accuracy in terms of adj-\(R^2\). So the best model could be the second model.
ggplot(num_feature, aes(x = likes, y = views)) + geom_point()
ggplot(num_feature, aes(x = dislikes, y = views)) + geom_point()
ggplot(num_feature, aes(y = views, x = comment_count)) + geom_point()
As we realized through the correlation matrix, there seems to be linear relation between views and likes and another linear relation between dislikes and comment_count. We investigate each of these single variable linear relations through applying least square linear regression models and check for the significance of regression.
dislike_comment <- lm(comment_count ~ dislikes, data = num_feature)
summary(dislike_comment)
Call:
lm(formula = comment_count ~ dislikes, data = num_feature)
Residuals:
Min 1Q Median 3Q Max
-347616 -3089 -2645 -1204 508732
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.20e+03 2.51e+02 12.8 <2e-16 ***
dislikes 7.13e-01 6.95e-03 102.6 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 17200 on 4710 degrees of freedom
Multiple R-squared: 0.691, Adjusted R-squared: 0.691
F-statistic: 1.05e+04 on 1 and 4710 DF, p-value: <2e-16
Thus, our linear regression model between “comment_count” and “dislikes” is determined by \(comment\_count = 3199 + 0.713(dislikes)\). As the p-value for the F-statistic is too low, the regression is significant.
temp_predic <- predict(dislike_comment, interval="prediction")
predictions on current data refer to _future_ responses
new_df <- cbind(num_feature, temp_predic)
ggplot(new_df, aes(x = dislikes, y = comment_count))+
geom_point() +
geom_line(aes(y=lwr), color = "red", linetype = "dashed")+
geom_line(aes(y=upr), color = "red", linetype = "dashed")+
geom_smooth(method=lm, se=TRUE)
anova(dislike_comment)
Analysis of Variance Table
Response: comment_count
Df Sum Sq Mean Sq F value Pr(>F)
dislikes 1 3.11e+12 3.11e+12 10527 <2e-16 ***
Residuals 4710 1.39e+12 2.95e+08
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
The plots above show that there are some extreme outliners in our dataset. The influential data points adversely impact on our regression models and can favor some regression coefficients. To address this problem, we should find such point. We start with the leverage points:
The following indices show that the corresponding data points are leverage points.
p = 2
n = 4712
inflc = influence(view_full)
inflc$hat[inflc$hat>2*p/n]
18 50 77 79 84 96 102 156 287 315 316 323 355 388 463
0.00236 0.00160 0.05615 0.00271 0.00449 0.00242 0.00209 0.00214 0.00112 0.01263 0.00121 0.00093 0.04849 0.00376 0.00134
489 512 531 539 635 645 664 730 732 736 757 760 771 790 794
0.00105 0.00334 0.00331 0.00091 0.00235 0.00385 0.09073 0.00176 0.00100 0.01537 0.00722 0.00343 0.00243 0.00100 0.03680
852 864 872 877 910 917 932 967 1046 1091 1103 1123 1144 1152 1153
0.00195 0.00129 0.00323 0.00836 0.00109 0.00096 0.00669 0.00721 0.00385 0.00498 0.00124 0.00097 0.00085 0.00093 0.00194
1211 1242 1316 1400 1408 1416 1502 1521 1529 1536 1559 1566 1574 1664 1667
0.00241 0.00954 0.00527 0.04476 0.01235 0.00138 0.00091 0.00085 0.00487 0.00347 0.01777 0.00093 0.00249 0.00175 0.01663
1720 1731 1732 1742 1813 1820 1837 1918 1921 1944 1967 2028 2196 2215 2232
0.00176 0.00188 0.61457 0.00142 0.00166 0.00212 0.00476 0.00119 0.00160 0.00096 0.00360 0.00636 0.00202 0.00103 0.01329
2238 2247 2256 2281 2282 2316 2333 2335 2379 2407 2409 2456 2511 2512 2542
0.00179 0.01336 0.01353 0.01512 0.00088 0.00108 0.00342 0.00349 0.04853 0.00301 0.00091 0.01030 0.18683 0.00163 0.00849
2572 2591 2593 2598 2659 2668 2756 2782 2829 2845 2867 2911 2942 2973 2978
0.00293 0.00742 0.02359 0.00175 0.05088 0.00443 0.02664 0.00090 0.00257 0.00088 0.00890 0.01106 0.00343 0.00188 0.00134
3015 3060 3072 3133 3142 3176 3235 3308 3310 3312 3327 3350 3429 3457 3467
0.00110 0.00177 0.15030 0.11017 0.00464 0.00161 0.00451 0.00115 0.00101 0.00278 0.00136 0.00196 0.00253 0.62693 0.00477
3477 3478 3526 3565 3567 3572 3595 3605 3636 3651 3674 3692 3761 3791 3820
0.00408 0.00295 0.02680 0.00364 0.00154 0.00951 0.00548 0.00088 0.00511 0.00957 0.00154 0.02193 0.00192 0.03384 0.00150
3856 3886 3893 3901 3907 3916 3937 3958 3991 3993 4049 4067 4099 4133 4136
0.00126 0.00131 0.00520 0.00122 0.09218 0.00316 0.00092 0.00087 0.00086 0.00232 0.00104 0.00818 0.00300 0.01615 0.00338
4162 4173 4184 4226 4230 4254 4313 4334 4335 4365 4395 4432 4449 4451 4485
0.01574 0.00123 0.00192 0.00416 0.00166 0.02118 0.01745 0.01160 0.00122 0.00218 0.00189 0.00792 0.00097 0.00194 0.00208
4510 4522 4630 4635 4641 4700
0.00110 0.00138 0.00202 0.00216 0.00754 0.00089
To determine which leverage points are actual influential points, we apply the Cook distance.
cookdis <- cooks.distance(view_full)
cookdis[cookdis>1]
355 664 794 1732 3072 3457 3907
1.2 17.6 1.1 4.2 4.4 9.9 3.4
Thus, the data points with indices below are influential:
\[\text{Influential Point Indices}: 355, 664, 794, 1732, 3072, 3457, 3907\]
So, it might be better to remove them from our data before any modelling.
new_data <- num_feature[-c(355, 664,794,1732,3072,3457,3907), ]
new_lm <- lm(views ~ likes + dislikes + comment_count, data = new_data)
summary(new_lm)
Call:
lm(formula = views ~ likes + dislikes + comment_count, data = new_data)
Residuals:
Min 1Q Median 3Q Max
-1.17e+08 -1.82e+06 -1.49e+06 -4.30e+05 4.59e+08
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.59e+06 2.02e+05 7.84 5.4e-15 ***
likes 1.41e+02 2.75e+00 51.34 < 2e-16 ***
dislikes 3.67e+02 2.42e+01 15.18 < 2e-16 ***
comment_count -3.56e+02 2.03e+01 -17.53 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 1.3e+07 on 4701 degrees of freedom
Multiple R-squared: 0.509, Adjusted R-squared: 0.508
F-statistic: 1.62e+03 on 3 and 4701 DF, p-value: <2e-16
But the influential point removal would decrease the adj-\(R^2\). It is because there are other leverage points that are not influential points and so they are not impacting on the linear model. So in evaluation they adversely impact on the adj-\(R^2\).
summary(lm(formula = views ~ likes + dislikes , data = new_data))
Call:
lm(formula = views ~ likes + dislikes, data = new_data)
Residuals:
Min 1Q Median 3Q Max
-1.43e+08 -1.87e+06 -1.60e+06 -5.20e+05 4.59e+08
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.68e+06 2.09e+05 8.06 9.9e-16 ***
likes 1.08e+02 2.07e+00 52.21 < 2e-16 ***
dislikes 1.30e+02 2.07e+01 6.28 3.6e-10 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 1.4e+07 on 4702 degrees of freedom
Multiple R-squared: 0.476, Adjusted R-squared: 0.476
F-statistic: 2.14e+03 on 2 and 4702 DF, p-value: <2e-16
It looks like that the current predictors are not able to find a good model relating the number of views to the other variables. So, we design new features that can improve the accuracy of our model. Based on the domain knowledge, we take a careful look at the notion of trending. It sounds like the rate of changes in the views. So, we try to find such a rate. Trending videos should have high rate of changes in the number of views. Since we know the number of trending days, the initial views and the last count for the views, we can approximate this rate as follows:
\[\text{Average Daily Trending} = \frac{\text{(Last_Views_Count) - (First_Views_Count)}}{\text{The Number of Days in Trending}}\] We add this value to our data set.
options(digits=2)
new_num_feature<- us_videos %>% group_by(video_id) %>%
summarize(
first_trending_date = min(trending_date),
tot_views = sum(views), tot_likes = max(likes),
tot_dislikes = max(dislikes), tot_comment_count = max(comment_count),
trending_days = length(video_id),
avg_daily_views = round((max(views)-min(views))/length(video_id))
)
new_num_feature
ggplot(new_num_feature, aes(y = tot_views, x = avg_daily_views)) + geom_point()
cor_matrix <- cor(new_num_feature[, c("tot_views", "tot_likes", "tot_dislikes", "tot_comment_count", "avg_daily_views")])
corrplot(cor_matrix, method = "number", order = "hclust")
#method = "circle", order = "hclust")
Just like the previous models, if we let all variables be in the model, the there would be high VIF (~ 7.9) for the comment_counts.
car::vif(lm(tot_views ~ tot_likes + tot_dislikes + tot_comment_count + avg_daily_views, data = new_num_feature))
tot_likes tot_dislikes tot_comment_count avg_daily_views
5.5 5.4 7.9 3.3
So we already removed it and design a reduced models without it.
new_view_lm <- lm(tot_views ~ tot_likes + tot_dislikes + avg_daily_views, data = new_num_feature)
summary(new_view_lm)
Call:
lm(formula = tot_views ~ tot_likes + tot_dislikes + avg_daily_views,
data = new_num_feature)
Residuals:
Min 1Q Median 3Q Max
-1.82e+08 -7.60e+05 -3.53e+05 2.38e+05 3.31e+08
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.65e+05 1.85e+05 1.97 0.049 *
tot_likes 4.29e+01 1.90e+00 22.51 < 2e-16 ***
tot_dislikes 2.92e+01 5.79e+00 5.04 4.8e-07 ***
avg_daily_views 3.72e+01 6.27e-01 59.36 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 1.2e+07 on 4708 degrees of freedom
Multiple R-squared: 0.771, Adjusted R-squared: 0.771
F-statistic: 5.29e+03 on 3 and 4708 DF, p-value: <2e-16
The adj-\(R^2\) is pretty high and much better than the previous models. So, we stick to this model. Moreover, this model has no high VIF.
car::vif(new_view_lm)
tot_likes tot_dislikes avg_daily_views
2.3 1.4 2.5
We would like to know which categories are the most popular categories. So we simply count them and apply a barplot for the counts.
cat_count <- us_videos %>% group_by(category_id) %>% summarize(count = length(category_id))
ggplot(cat_count,aes(x = category_id, y = count,fill=category_id))+geom_bar(stat="identity")
You might be interested din the how many days in advance you should upload a video to make it a candidate for trending in a particular day. We answer this question below:
us_videos$days_after_pub <- us_videos$trending_date-us_videos$publish_time
#us_videos[us_videos$days_after_pub<30,]
ggplot(us_videos[us_videos$days_after_pub<30,],aes(as.factor(days_after_pub),fill=as.factor(days_after_pub)))+geom_bar()+guides(fill="none")+labs(title=" The Number of Days After Publish Day",subtitle="Days")+xlab(NULL)+ylab(NULL)
We worked with an interesting data set for trending YouTube videos. This data set has several features such as the number of counts, the number of views, the number of likes and dislikes, the title of the video, etc. We focused on the continuous features and tried to find a linear model for the number of views in terms of other numeric features. We had three original models whose adj-\(R^2\) are not high. So, we started handcrafting new features. The new feature is the average daily views for each video. This new feature could highly contribute to the model. If we know this value for the first two days of trending, we can approximate the number of views after a specific period of time. Having the best set of features in a data analysis problem is the most significant part.
The category titles are available here.
cat_title <- c('People & Blogs', 'Comedy', 'News & Politics', 'Entertainment',
'Sports', 'Autos & Vehicles', 'Gaming', 'Film & Animation',
'Howto & Style', 'Science & Technology', 'Shows', 'Music',
'Travel & Events', 'Education', 'Pets & Animals',
'Nonprofits & Activism')
length((cat_title))
[1] 16